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ABSTRACT 

Suppose you find the same username on different online 
services, what is the probability that these usernames 
refer to the same physical person? This work addresses 
what appears to be a fairly simple question, which has 
many implications for anonymity and privacy on the 
Internet. One possible way of estimating this probability 
would be to look at the public information associated to 
the two accounts and try to match them. However, for 
most services, these information are chosen by the users 
themselves and are often very heterogeneous, possibly 
false and difficult to collect. Furthermore, several web- 
sites do not disclose any additional public information 
about users apart from their usernames (e.g., discus- 
sion forums or Blog comments), nonetheless, they might 
contain sensitive information about users. 

This paper explores the possibility of linking users 
profiles only by looking at their usernames. The intuition 
is that the probability that two usernames refer to the 
same physical person strongly depends on the "entropy" 
of the username string itself. Our experiments, based 
on crawls of real web services, show that a significant 
portion of the users' profiles can be linked using their 
usernames. To the best of our knowledge, this is the 
first time that usernames are considered as a source of 
information when profiling users on the Internet. 

1. INTRODUCTION 

Online profiling is a serious threat to users privacy. In 
particular, the ability to trace users by linking multiple 
identities from different public profiles may be of great 
interest to profilers, advertisers and the like. Indeed, it 
might be possible to gather information from different 
online services and combine it to sharpen the knowledge 
of users identities. This knowledge may then be exploited 
to perform efficient social phishing or targeted spam, and 
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might be as well used by advertisers or future employers 
seeking information. As it has been colloquially put by 
a judge of the US Supreme Court in a recent case about 
warrantless GPS tracking] "When it comes to privacy, 
the whole may be more revealing than its parts." 

Recent works [1] [3] showed how it is possible to retrieve 
users information from different online social networks 
(OSN). All of these works mainly exploit flaws in the 
OSN's API design (e.g., Facebook friend search). Other 
approaches fTF] use the topology of social network friend 
graphs to de-anonymize its nodes. 

In this paper, we propose a novel methodology that 
uses usernames -an easy to collect information- rather 
than social graphs to tie user online identities. Our 
technique only assumes knowledge of usernames and it 
is widely applicable to all web services that publicly 
expose usernames. Our purpose is to show that users' 
pseudonyms allow simple, yet efficient tracking of online 
activities. 

Recent scraping services' activities illustrate well the 
threats introduced by the ability to match up user's 
pseudonyms on different social networks |2]. For instance, 
PeekYou.com has lately applied for a patent for a way 
to match people's real names to pseudonyms they use 
on blogs, OSN services and online forums [14] . The 
methodology relies on public information collected for 
an user, that might help in matching different online 
identities. The algorithm empirically assigns weights to 
each of the collected information so as to deem different 
identities to be the same. However, the algorithm is ad- 
hoc and not robust to false or mismatching information. 
In light of these recent developments, it is desirable that 
the research community investigates the capabilities and 
limits of these profiling techniques. This will, in turn, 
allow for the design of appropriate countermeasures to 
protect users' privacy. 

In general, profiling unique identities from multiple 
public profiles is a challenging task, as information from 
public profiles is often incorrect, misleading or altogether 
missing [11] . Techniques designed for the purpose of 
profiling need to be robust to these occurrences. 

Contributions. 

The contributions of this paper are manifold. First, 
we introduce the problem of linking multiple online iden- 
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tities relying only on usernames. Second, we devise an 
analytical model to estimate the uniqueness of a user- 
name, which can in turn be used to assign a probability 
that a single username, from two different online services, 
refers to the same user. Based on language models and 
Markov Chain techniques, our model validates an intu- 
itive observation: usernames with low "entropy" (or to 
be precise Information Surpmsal) will have higher prob- 
abilities of being picked by multiple persons, whereas 
higher entropy usernames will be very unlikely picked 
by multiple users and refer in the vast majority of the 
cases to unique users. 

Third, we extend this model to cases when usernames 
are different across many online services. In essence, 
given two usernames our technique returns the proba- 
bility that these usernames refer to the same user, and 
allows then to effectively trace users identities across 
multiple web services using their usernames only. While 
we acknowledge that our technique cannot trace users 
that choose unrelated usernames on purpose, experimen- 
tal data shows that users tend to choose closely related 
usernames on different services. Finally, by applying our 
technique to subsets of usernames we extracted from real 
cases scenarios, we validate and discuss our technique in 
the wild. 

We envision several possible uses of these techniques, 
not all of them malicious. In particular, users might 
use our tool to test how unique their username is and, 
therefore, take appropriate decision in case they wish to 
stay anonymous. To this extent we provide an online 
tool that can help users choose appropriate usernames by 
measuring how unique and traceable their usernames are. 
The tool is available at http : //planete . inrialpes . f r/ 
pro j ect s/how-unique- are-your-usernames 

Paper organization. 

In Section [2] we overview the related work on privacy 
and introduce the machine learning tools used in our 
analysis. In Section |4j we introduce our measure to 
estimate the uniqueness of usernames and in Section [5j 
we extend our model to compute the probability that 
two usernames refer to the same person and validate it 
using the dataset we collected from eBay and Google 
(Section |3J. Different techniques are introduced and 
evaluated. Finally, in Section [7] we discuss potential 
impact of our proposed techniques and present some 
possible countermeasures. 

2. RELATED WORK AND BACK- 
GROUND 

2.1 Related Work 

Tracking OSNs users. 

In [Tl] the authors propose to use what they call the 
online social footprint to profile users on the Internet. 
This footprint would be the collection of all little pieces 
of information that each user leaves on web services and 
OSNs. While the idea is promising this appears to be 
only a preliminary work and no model, implementation 
or validation is given. 



Similarly in [4], Bilge et al. discuss how to link the 
membership of users to two different online social net- 
works. Noticing that there might be discrepancies in 
the information provided by a single user in two social 
networks, the authors rely on Google search results to 
decide the equivalence of selected fields of interest (as 
for assigning uniqueness of a user). Typically, the input 
of their algorithm is the name and surname of a user, 
that is augmented by the education/occupation as pro- 
vided in two different social networks. They use such 
input to start two separate Google searches, and if both 
appear in the first top three hits, these are deemed to 
be equivalent. The corresponding users are consequently 
identified as a single user on both social networking sites. 
Bilge et al.'s work illustrates well how challenging the 
process of identifying users from multiple public profiles 
is. Despite the usage of customized crawler and parser 
for each social network, the heterogeneity of information 
as provided by users (if correct) makes the process hard 
to deploy, if not unfeasible, at a large scale. 

Record linkage. 

Record linkage (RL)(or alternatively Entity Resolu- 
tion) |9, 5 refers to the task of finding records that 
refer to the same entity in two or more databases. This 
is a common task when databases of users records are 
merged. For example, after two companies merge they 
might also want to merge their databases and find du- 
plicate entries. Record linkage is needed in this case 
if there are no unique identifiers available (e.g., social 
security numbers). 

In RL terminology two records that have been matched 
are said to be linked. For the purpose of linking pro- 
files using usernames, we test several RL techniques and 
compare their performance to the ones introduced in 
this paper. However, differently from the record linkage 
problem, in our setup a complete match of two differ- 
ent usernames does not necessarily indicate a positive 
identification. Furthermore, the application of record 
linkage techniques to link public online user profiles is 
novel to the best of our knowledge and presents several 
challenges of its own. 

Tracking browsers across sessions. 

Another related problem is the fingerprinting of web 
clients. Usually, ad servers set unique cookies on the 
browsers to allow for easy tracking of users between 
HTTP requests. A simple and straightforward practise 
on browsers to limit the risk of re-identification is to 
restrict or disable the use of third-party cookies. How- 
ever, recent research [8] has shown that different browser 
installations might contain enough unique features or 
"entropy" to allow for re-identification even in the absence 
of long lived unique identifiers like cookies. 

De-anonymizing sparse database and graph data. 

[17| proposes an identification algorithm targeting 
anonymized social network graphs. The main idea of 
this work is to de-anonymize online social graph based 
on information acquired from a secondary social network 
users are known to belong to as well. Similarity identified 



in the network topologies of both services allows then to 
identify users belonging to the anonymized graph. 

2.2 Background 

2.2.1 Information Surprisal 

Self-information or Information Surprisal measures 
the amount of information associated to a specific out- 
come of a random variable. If X is a random variable 
and x one possible outcome, we denote the information 
surprisal of x as I(x). Information Surprisal is com- 
puted as I(x) = — log 2 (P(a;)) and hence depends only 
on the probability of x. The smaller the probability of 
x the higher is the associated surprisal. Entropy, on 
the other hand, measures the information associated to 
a random variable (regardless of any specific outcome), 
denoted H(X). Entropy and Surprisal are deeply re- 
lated as entropy can be seen as the expected value of 
the information surprisal, H(X) = E(I(X)). Both are 
usually measured in bits. 

Suppose there exists a discrete random variable that 
models the distribution of usernames in a population, 
call this variable U. The random variable U follows a 
probability mass function Pu that associates to each 
username u a probability P(u). In this context, the 
information surprisal of P(u) is the amount of identifying 
information associated to a username u. Every bit of 
surprisal adds one bit of identifying information and thus 
allows to cut the population in which u might lie in half. 

If we assume that there are W users in a population, 
then a username u identifies uniquely a user in the pop- 
ulation if I(P(u)) > \og 2 iW). In this sense, information 
surprisal gives a measure of the "uniqueness" of a user- 
name u and it is the measure we are going to use in this 
work. The challenge lies in estimating the probability 
P(u), which we will address in Section [4] 

Our treatment of information surprisal and its associ- 
ation to privacy is similar to the one recently suggested 
in [8] in the context of fingerprinting browsers. 

3. THE DATASET 

Our study was conducted on several different lists of 
usernames: (a) a list of 3.5 million usernames gathered 
from public Google profiles; (b) a list of 6.5 million 
usernames gathered from eBay accounts; (c) a list of 
16000 thousand username gathered from our research 
center LDAP directory; (d) two large username lists 
found online used in a previous study from DelPAmico 
ct al. m: a "finnish" dataset and a list of usernames 
collected from Myspace. 

The "finnish" dataset comes from a list publicly dis- 
closed in October 20070 The dataset contains user- 
names, email addresses and passwords of almost 79000 
user accounts. This information has been collected from 
-most likely by hacking- the servers of several Finnish 
web forums. The MySpace dataset comes from a phish- 
ing attack, setting a fake MySpace login web page. This 
data has been discolsed in October 2006 and it contains 
more than 30000 unique usernames. 

^http : //www. f- secure . com/weblog/ar chives/ 
00001293.html 



The use we made of these datasets was threefold. First, 
we used the combined list of 10 million usernames (from 
eBay and Google) to train our Markov Chain model 
needed for the probability estimations. Second, we used 
the information on Google profiles to gather ground truth 
evidence and test our technique to link multiple public 
profiles even in case of slightly different usernames (Sec- 
tion [5}. Third, we used all the datasets to characterize 
username uniqueness and depict Surprisal information 
distributions as seen in the wild. Our objective here is to 
validate our techniques on several datasets, where users 
come from widely distributed locations and may have 
different habits as for web services usage and usernames' 
choices. 

Notably, a feature of Google Profiles allowed us to 
build a ground truth we used for validation purposes. 
In fact, users on Google Profiles can optionally decide 
to provide a list of their other accounts on different 
OSNs and web services. This provided us with a ground 
truth, for a subset of all profiles, of linked accounts and 
usernames. 

In our experiments we observed that web services differ 
significantly in their username policies. However, almost 
all services share a common alphabet of letters and 
numbers. Analyzing our most complete set of 10 million 
distinct usernames it appears clear that 85% of the users 
choose alphanumerical only usernames that thus comply 
to all username policy. This fact is of interest when 
evaluating the applicability of the techniques explained 
in this work. 

4. ESTIMATING USERNAME UNIQUE- 
NESS 

As we explained above, we would like to have a mea- 
sure of username uniqueness, which can quantify the 
amount of identifying information each username carries. 
Information Surprisal is a measure, expressed in bits, 
that serves this purpose. However, in order to compute 
the Information Surprisal associated to usernames, we 
need a way to estimate the probability P(u) for each 
username u. 

A naive way to estimate P(u), given a dataset of 
usernames coming from different services, would be to 
use Maximum Likelihood Estimation (MLE). If we have 
N usernames then we can estimate the probability of 
each username u as c °""*(") ; if u belongs to our dataset, 
and otherwise. Where count{u) is simply the number 
of occurrences of u in the sample. In this case we are 
assigning maximum probability to the observed samples 
and zero to all the others. This approach has several 
drawback, but the most severe is that it cannot be used to 
give any estimation for the usernames not in the sample. 
Furthermore, the estimation given is very coarse. 

Instead, we would like to have a probability estima- 
tion that allows us to give estimate probabilities for 
usernames we have never encountered. Markov-Chains 
have been successfully used to extrapolate knowledge 
of human language from small corpuses of text. In our 
case, we apply Markov Chain techniques on usernames 
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to estimate their probability. 

4.1 Estimating username probabilities 
with 

Markov Chains 

Markov models are successfully used in many machine 
learning techniques that need to predict human gener- 
ated sequences of words, as in speech recognition uE\. In 
a very common machine learning problem, one is faced 
with the challenge of predicting the next word in a sen- 
tence. If for example the sentence is "The quick brown 
fox", the word jumps would be a more likely candidate 
than car. This problem is usually referred to as Shannon 
Game following Shannon's seminal work on the topic[18|. 
This task is usually tackled using Markov- Chains and 
modeling the probability of the word jumps depending 
of a number of words preceding it. 

In our scenario, the same technique can be used to 
estimate the probability of username strings instead of 
sentences. For example, if one is given the beginning of 
a username like sara, it is possible to predict that the 
next character in the username will likely be h. Notably 
Markov-Chain techniques have been successfully used to 
build password crackers [16] and analyse the strength of 
passwords [7]. 

Without loss of generality, the probability of a given 
string ci, c n can be written as II™ =1 P(ci|ci, Ci_i). 
In order to make calculation possible a Markovian as- 
sumption is introduced: to compute the probability of 
the next character, only the previous k characters are 
considered. This assumption is important because it sim- 
plifies the problem of learning the model from a dataset. 
The probability of any given username can be expressed 
as: 
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To utilize Markov-Chain for our task we need to es- 
timate, in a learning phase, the model parameters (the 
conditional probabilities) using a suitable dataset. In our 
experiments we used the database of approximately 10 
million usernames populated by collecting Google public 
profiles and eBay user accounts (see Section [3]). 

In general, the conditional probabilities are computed 
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by counting the number of n-grams that contain charac- 
ter Ci and dividing it by the total number of n — 1-grams 
without the character a. Where an n-gram is simply a 
sequence of n characters. 

Markov-Chain techniques benefit from the use of 
longer n-grams, because longer "histories" can be cap- 
tured. However longer n-grams result into an exponential 
decrease of the number of samples for each n-gram. In 
our experiments we used 5-grams for the computation 
of conditional probabilities. 

Once we have calculated P(u), we can trivially com- 
pute the information surprisal of u as — log 2 (P(u)). 




Figure 1: Surprisal distribution for eBay and 
Google usernames 




Figure 2: Surprisal distribution for other ser- 
vices 



In Appendix [8] we give a different, yet related, proba- 
bilistic explanation of username uniqueness. 

4.2 Experiments 

We conducted experiments to estimate the surprisal of 
the usernames in our dataset and hence how unique and 
identifying they are. As explained above, our Markov- 
Chain model was trained using the combined 10 million 
usernames gathered from eBay and Google. The dataset 
was used for both training and testing by using leave- 
one-out cross validation. Essentially, when computing 
the probability of a username u using our Markov-Chain 
tool, we excluded u from the model the occurrence counts. 
This way, the probability estimation for it depended on 
all the other usernames but u. 

We computed information surprisal for all the user- 
names in our dataset and the results are shown in Figure 
[I] The entropy of both distributions is higher than 35 
bits which would suggest that, on average, usernames 
are extremely unique identifiers. 

Notice the overlap in the distributions that might indi- 
cate that our surprisal measure is stable across different 
services. Notably, the two services have largely different 
username creation policies, with eBay accepting user- 
names as short as 3 characters from a wider alphabet and 
Google giving more restrictions to the users. Also, the 
account creation interfaces vary greatly across the two 




Figure 3: Cumulative distribution function for 
the surprisal of all the services 



services. In fact, Google offers a feature that suggests 
usernames to new users derived from first and last names. 
Probably this is the reason why Google usernames have 
a higher Information Surprisal (see Figure 3|. It must 
also be noted that both services have hundreds of mil- 
lions of reported users. This raises the entropy of both 
distributions: as the number of users increases they are 
forced to choose usernames with higher entropies to find 
available ones. Overall it appears clear that usernames 
constitute highly identifying piece of information, that 
can be used to track users across websites. 

In Figure [2] we plot information surprisal for three 
datasets gathered from different services. This graph is 
motivated by our need to understand how much surprisal 
varies across services. 

The results are similar to the ones obtained for eBay 
and Google usernames. The Finnish list is noteworthy, 
these usernames come from different Finnish forums and 
most likely belong to Finnish users. However, Suomi (the 
official language in Finland) shares almost no common 
roots with Roman or Anglo-Saxon languages. This can 
be seen as a good representative of the stability of our 
estimation for different languages. 

Furthermore, notice that the dataset coming from 
our own research center (INRIA) has a higher surprisal 
than all the other datasets. While there are a possible 
number of explanations for this, the most likely one 
comes from the username creation policies in place that 
require usernames to be the concatenation of first and 
last name. The high surprisal comes despite the fact that 
the center has only around 16000 registered usernames 
and lack of availability does not pressure users to choose 
more unique usernames. 

Comparing the distributions of Information surprisal 
of our different datasets is enlightening, as illustrated 
in Figure [3] This confirms that usernames collected 
from the INRIA center exhibit the highest information 
surprisal, with almost 75% of usernames with a surprisal 
higher than 40 bits. We also observe that both Google 
and MySpace CDF curves closely match. In all cases, it is 
worth noticing that the maximum (resp. the minimum) 
fraction of usernames that do exhibit an information 
surprisal less than 30 bits is 25% (resp. less than 5%). 
This shows that a vast majority of users from our datasets 
can be uniquely identified among a population of 1 billion 



users, relying only on their usernames. 

5. USERNAME COUPLES LINKAGE 

The technique explained above can only estimate the 
uniqueness of a single username across multiple web 
services. However, there are cases in which users, either 
willingly or forced by availability, decide to change their 
username. 

We would like to know whether users change their 
usernames in any predictable and traceable way. In 
Figure |4(a)| and |4(b)| is plotted the distribution of the 
Levenshtein (or E dit) D istance for username couples. In 
particular, Figure [4(a) | depicts the distribution for 10 4 
username couples we can verify to belong to single users 
(we call this set L for linked), using our dataset. On 
the other hand, Figure [4(b) | shows the distribution for a 
sample of random username couples that do not belong 
to a single user (we call this set NL for non-linked). In 
the first case the mean distance is 4.2 and the standard 
deviation is 2.2, in the second case the mean Levenshtein 
distance is 12 and the standard deviation is 3.1. Clearly, 
linked usernames are much closer to each other than non 
linked ones. This suggests that, in many occurrences, 
users choose usernames that are related to each other. 
The difference in the two distributions is remarkable and 
so it might be possible to estimate the probability that 
two different usernames are used by the same person or, 
in record linkage terminology, to link different usernames. 

However, as illustrated in Section [4] and differently 
from record linkage, an almost perfect username match 
does not always indicate that the two usernames belong 
to the same person. The probability that two user- 
names, let's say sarah and sarah2, are linked (we call 
it P 3 ame(sarah, sarah2)) should depend on: 

1. how much 'information' there is in the common 
part of the usernames (in this case sarah) and, 

2. how likely is that a user will change one username 
into the other (in this case the addition of a 2 at 
the end). 

We will show two different novel approaches at solving 
this problem. The first approach uses a combination 
of Markov Chains and a weighted Levenshtein Distance 
using probabilities. The second approach makes use of 
the theory and techniques used for information retrieval 
in order to compute document similarity, specifically 
using TF-IDF. 

We compare these two techniques to record linkage 
techniques for a base-line comparison. Specifically we 
use string-only metrics like the normalized Levenshtein 
Distance and Jaro distance to link username couples. 

Method 1: Linkage using Markov-Chains and LD. 

First of all, we need to compute the probability of a 
certain username ui being changed into U2- We denote 
this probability as P(u2\u\). Going back to our original 
example, P(sarah2\sarah) is equal to the probability of 
adding the character 2 at the end of the string sarah. 
This same principle can be extended to deletion and 
substitution. In general, if two strings u\ and «2 differ 




(a) Levenshtein distance distribution for linked 
username couples (set L, \L\ — 10 4 ) 




(b) Levenshtein distance distribution for non- 
linked username couples (set NL, \NL\ — 10 4 ) 

Figure 4: Levenshtein distance distribution 
for username couples gathered from 3 million 
Google profiles 



by a sequence of basic operations 01, 02, o n , we can 
estimate P(«2|mi) = P{u\\u2) — p(oi)p(o2)...p(o n ). 

In order to estimate the probability that username Mi 
and U2 belong to the same person, we need to consider 
that there are two different possibilities on how Mi and 
U2 were chosen in the first place. The first possibility 
is that they were picked independently by two different 
users. The second possibility is that they were picked 
by the same user, hence they are not independent. 

In the former case we can compute P(ui A U2) as 
P(mi) * P(« 2 ) since we can assume independence. In the 
latter, P{u\ A u 2 ) equals P(ui) * P(u2|ui) in case the 
user is the same. Note that using Markov Chains and 
the our estimation of P(m2|mi), we can compute all the 
terms involved. Estimating the probability P sa ,me{ui, M2) 
is now a matter of estimating and comparing the two 
probabilities above. 

The formula for P sa me(ui, U2) is derived from the 
probability P(u\ A U2) using Bayes' Theorem. In fact, 
we can rewrite the probability above as P(u\ A U2\S) 
where the random variable S can have values or 1 and 
it is 1 if ui and U2 belong to the same person and 
otherwise. Hence without loss of generality: 



P{S\ Ul Alt 2 ) = 



P(ui Au 2 \S)P(S) 



Es=o,i( J, («iA« 3 |5)*P(5)) 



which leads to P(S = l|wi A « 2 ) equal to 

P(m)P(M 2 |m)P(S= 1) 

P(ui)P(u 2 )P(S = 0) + P(«i)P(« 2 |wi)P(S = 1) 



where P(S = 1) is the probability of two usernames be- 
longing to the same person, regardless of the usernames. 
We can estimate this probability to be where W is 
the population size. Conversely P(S = 0) = 1 . And 
so we can rewrite P sa me(ui, U2) as P(S = A 112) 
equal to 

P(m)PMm) 

W * P{ Ul )P{u 2 )^ + W * P(ui)P(ua|ui)£ 

Please note that when ui = 112 — u then the formula 
above becomes 

I 
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which is exactly the same estimation we devised for the 
username uniqueness in Appendix. 

Method 2: Linkage using TF-IDF. 

In this case we use a well known information retrieval 
tool called TF-IDF. However, TF-IDF similarity mea- 
sures the distance between two documents (or a search 
query and a document), which are set of words. 

The term frequency-inverse document frequency (TF- 
IDF) is a weight used to evaluate how important is a 
word to a document that belongs to a corpus [13]. The 
weight assigned to a word increases proportionally to the 
number of times the word appears in the corpus but the 
importance decreases for common words in the corpus. 

If we have a collection of documents D in which each 

document d £ D is a set of terms, then we can compute 

the term frequency of term ti G d as: tfij = 

where m t j is the number of times term ti appears in 

document dj . The inverse document frequency of a term 

ti in a corpus D is idfi = ^ where Ci is the number 

of documents in the corpus that contain the term ti. 

The TF-IDF is computed as (tf — idf)i,j = tfijidfi. 

The TF-IDF is often used to measure the similarity 

between two documents, say d and d! , in the following 

way: first the TF-IDF is computed over all the term 

in d and d! and the results are stored in two vectors v 

and v'; then the similarity between the two vectors is 

computed, for example using a cosine similarity measure 

sim(d,d') = n v „y ,,, . 
v ' ) IMIIKII 
In our case we need to measure the distance between 

usernames composed of a single string. The way we 
solved this problem is pragmatical: we consider all pos- 
sible substrings, of size q, of a string u to be a document 
d u - Where d u can be seen as the building blocks of the 
string u. The similarity between username Mi and U2 is 
computed using the similarity measure described above. 
This similarity measure is referred to in the literature as 
q-gram similarity [19], however it has been proposed for 
fuzzy string matching in database applications and its 
application to online profiling is novel. 

Method 3: String Only Similarity Metrics from 
Record Linkage. 

The Levenshtein (or edit) distance (LD) measures the 
similarity between two strings of different or equal length. 
It is defined as the minimum number of basic operations 
(deletion, insertion and substitution) needed to edit one 



string into another. The Levenshtein distance is a useful 
tool but its interpretation is not always clear in practice. 
For example, consider the case of the usernames alice 
and malice, in comparison to the couple vonneumann 
and j vonneumann. Both couples have a LD of 1 but 
in the latter case the two usernames are clearly more 
related than in the former. To cope with these cases a 
normalized Levenshtein distance (NLD) is used instead. 
While there are different methods used to normalize the 
LD between two strings, in our experimentations we use 
the following formula: NLD = 1 r , — , — -, — vr. 

° max[len{ui) ,ten(U2)) 

Note that a NLD is always a number between and 1 
since the LD can be at most equal to the length of the 
longest string. Note also that the longer u\ or 112 are 
the closer NLD approaches one. 

The Jaro distance [12] is yet another measure of simi- 
larity between two strings and it is mainly used in the 
area of record linkage. The distance is normalized and 
goes from to 1 with 1 indicating an exact match. We 
will use it as a base-line comparison with our novel ap- 
proaches. However, because of lack of space, we will not 
explain it in detail. 

5.1 Validation 

Our goal is to assess how accurately usernames can be 
used to link two different accounts. For this purpose we 
design and build a classifier to separate the two sets L 
and NL, respectively of linked usernames and non-linked 
usernames. 

For our tests the ground-truth evidence was gathered 
from Google Profiles and the size the number of linked 
username couples \L\ is 10000. In order to fairly estimate 
the performance of the classifier in a real world scenario 
we also randomly paired 10000 non-linked usernames to 
generate the NL set. 

The username couples were separated, shuffled and a 
list of usernames derived from L and NL was constructed. 
The task of the classifier is to re-link the usernames in L 
maximizing the username couples correctly linked while 
linking as few incorrect couples as possible. 

In practise for each username in the list our program 
computed the distance to any other username and kept 
only the link to the single username with highest similar- 
ity. If this value is above a threshold then the candidate 
couple is considered linked otherwise non-linked. 

5.1.1 Measuring the performance of our binary 
classifier 

Binary classifiers are primarily evaluated in terms 
of Precision and Recall, where precision is defined in 
terms of true positives (TP) and false positives (FP) as 
follows precision = T p^ FP and recall takes into account 
the true positives compared to false negatives recall = 
tp+fn • recau is the proportion of usernames that 
where correctly classified as unique (TP) out of all unique 
usernames (TP+FN). In addition to those two measures, 
we will also use Accuracy defined, with the addition of 
true negatives (TN) as accuracy — tp+ tn'+fp+fn ■ 

In our case, we are interested in finding usernames 
couples that are actually linked (true positives) while 
minimizing the number of couples that are linked by 



mistake (false positives). Precision for us is a measure 
of exactness or fidelity and higher precision means less 
profiles linked by mistake. Recall measures how com- 
plete our tool is, which is the ratio of linked profiles that 
are found out of all linked ones. Precision and recall 
are usually shown together in a precision/recall graph. 
The reason is that they are often closely related: a clas- 
sifier with high recall usually has sub-optimal precision 
while one with high precision has lower recall. An ideal 
classifier has both a high precision and recall of 1. 

Our classifier looks for potentially matching usernames. 
Once a set of potential matches is identified our scoring 
algorithms are used to calculate how likely it is that 
the two usernames represent the same real identity. By 
using our labeled test data, score thresholds can be 
selected that yield a desired trade-off between recall and 
precision. Figure [5] shows the precision and recall of the 
two methods discussed in this paper and known string 
metrics (Jaro and NLD) at various threshold levels. 

In general the metric based on Markov models out- 
performs the other metrics. Our Markov-Chain method 
has the advantage of having the highest precision values 
especially at recalls up 0.71. Remember that a recall 
of 0.71 means that 71% of all matching username cou- 
ples have been successfully linked. Depending on the 
application, one might favor TF-IDF based approach 
(method 2) which has good precision at higher recalls or 
the Markov chain approach (method 1) which has the 
highest precision up to recall 0.7. 

The string metrics (NLD and Jaro) perform surpris- 
ingly well in the task of matching different usernames. 
This is probably because, as shown in Figure 4, non- 
linked usernames tend to have higher mean distances 
between themselves than linked usernames. Both of 
these string-only-metric tools assign a positive weight for 
close strings and normalize it according to the maximum 
length of the strings. Hence, one possible explanation 
of the performance of NLD and Jaro distances is that 
the string length models sufficiently well the surprisal 
of a string for the purpose of username linkage. Indeed, 
Figure [6] shows a scatter plot of the entropy as computed 
by our uniqueness metric in comparison to the length of 
the strings. The graph clearly shows a central area of 
correlation between the two metrics and this is reflected 
by a high Pearson correlation between the two samples 
of 0.801. 

5.1.2 Discussion of Results 

Our results show that it is possible, with high precision, 
to link accounts solely based on usernames. This is due 
to the high average entropy of usernames and the fact 
that users tend to choose usernames that are related 
to each other. Clearly users could completely change 
their username for each service they use and, in this 
case, our technique would be rendered useless. However, 
our analysis shows that users indeed choose similar and 
high entropy usernames. This phenomenon can be seen 
as related to the much more studied password reuse 
phenomenon [To] that plagues web services. Users tend 
to reuse a small subset of passwords on 3.9 services 
on average, which can be explained by the difficulty 
of remembering multiple passwords. The same might 
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hold true for usernames, with the notable exception that 
usernames can be freely written and do not have to be 
securely stored. 

This technique might be used by profilers and adver- 
tiser trying to link multiple online identities together 
in order to sharpen their knowledge of the users. By 
crawling multiple web services and OSNs (a crawl of 
100M Facebook profiles has already been made available 
on BitTorrent) profilers could obtain lists of accounts 
with their associated usernames. These usernames could 
be then used to link the accounts using the techniques 
underlined in the previous section. 

5.1.3 Addressing Possible Limitations 

The linked username couples we used as ground truth 
have been gathered from Google Profiles. We have shown 
how that, in this sample, the users tend to choose related 
usernames. However, one might argue that this sample 
might not be sufficiently representative of the whole 
population. Indeed Google Profiles users might be least 
concerned about privacy and show a preference of being 
traceable by posting their information on their Profiles. 

We were not able to test our tool in linking profiles 



to certain types of web services in which users are more 
privacy aware, like dating and medical websites (e.g., 
WebMD). This was due to the difficulty of gathering 
ground truth evidence for this class of services. However, 
even if we assume that users choose completely unre- 
lated usernames for different websites, our tool might 
still be used. In fact, it might be the case that a user 
is registered on multiple dating websites with similar 
usernames. Those profiles might be linked together with 
our tool and more complete information about the user 
might be found. For example, a date of birth on a web- 
site might be linked with a city of residence and a first 
name on another, leading to real world identification. 
We acknowledge that, without evidence, this is only spec- 
ulation and a more thorough analysis is left for future 
work. 

5.1.4 Possible Improvements 

Finding linked usernames in a population requires 
time that is quadratic in the population size, as all 
possible couples must be tested for similarity. This 
might be too costly if one has millions of usernames 
to match. A solution to this problem is to divide the 
matching task in two phases. First, divide usernames 
in clusters that are likely be linked. For example, one 
could choose usernames that share at least one n-gram, 
thus restricting the number of combinations that need to 
be tried. Second, test all possible combinations within a 
cluster. 

Another possible improvement is to use a hybrid ap- 
proach in which different similarity metrics are combined 
to obtain a single similarity [5] . For instance one could 
use different similarity metrics (TF-IDF, Markov, Jaro, 
etc.) to compose a feature vector that can be then clas- 
sified using machine learning techniques like SVMs [fjj. 
Such hybrid approaches are known to perform better in 
the record linkage tasks [5]. However, we did not test 
or implement such approaches and their application to 
linking online identities is left as future work. 

6. RELEVANT USERNAME STATIS- 
TICS 

This section contains username statistics that com- 
plement the experiments we proposed and justify our 
technique in more practical scenarios. 

How do people choose their username?. 

We now aim to exploit our Google profiles dataset 
to verify whether people use their real name to com- 
pose pseudonyms as usernames. If this the case an 
attacker might try to generate likely usernames for a 
victim and track the victim on multiple web services 
using the techniques explained above to determine user- 
name uniqueness and linkage. Our analysis is then based 
on first and last names as provided by users in their 
profiles. We discard from the original dataset names 
provided with strings containing non Latin characters. 
These names cannot be mapped to a username according 
to the Google policy and so we restrict our study to the 
Latin alphabet (a-z). For simplicity, we also considered 
names composed by two words (i.e. both first and last 



names are provided). After this filtering, we ended up 
with 2.6M couples of names and usernames. 

We decomposed the name into two words that we refer 
to as Wi and W2- According to Google profiles policy, 
uii (resp. 11)2) refers to the first name (resp. last name). 
We first performed a preliminary matching using Perl's 
regular expressions to check whether usernames contain 
a combination of wi, W2 and digits. Results are shown 
in Tabled] 



Matching Condition 


# Usernames 


Percentage (%) 


u>i and w 2 and d 


207K 


7.93 


wi and W2 


774K 


29.63 


w\ and d 


142K 


5.47 


W2 and d 


132K 


5.06 


wi 


241K 


9.24 


w 2 


323K 


12.38 


Not matching 


792K 


30.3 


Total 


2,6M 


100 



Table 1: Usernames construction analysis match- 
ing first/last name of users: the first name (wi) 
and/or last name (1^2) and digits (d). 

The matching conditions are exclusive, following the 
order as presented in the table (e.g., a username matching 
u>i and u>2 is counted in the second row and not in the 
Wi and W2 rows separately). One of the most remarkable 
results is that 70% of the usernames contain at least one 
of the two parts of the real name. In particular, 30% 
of the collected usernames are constructed by simply 
concatenating the first and last name without adding 
any digit. We also observe that more than 18% of the 
usernames are constructed adding digits to the provided 
first and last names. This is most likely a typical behavior 
of users, whose first chosen pseudonym (a variant of first 
and last name) is not available and that do add digits 
(e.g. birth year) to be able to register into the service. 

After this preliminary analysis we want to understand 
how Wi and W2 are combined to build the exact username. 
In order to do that, we also consider the first character 
of each word, namely ci and C2 respectively. Table [2] 
shows multiple ways to combine wi, W2, c%, C2 and digits 
(d). We provide the percentage of usernames observed 
for each combination. The results show that more than 
50% of usernames match exactly the patterns we tested. 
One can observe that the most common way usernames 
are generated from users' real names is by concatenating 
the first and last name, in that specific order (almost 
14%), or by adding a dot between both names ( 13%). 



Pattern 


% 


Pattern 


% 


Pattern 


% 


w ± w 2 


13.99 


ci.io 2 


0.44 


w 2 d 


0.84 


W2W1 


1.46 


IU2C1 


0.28 


WiW2d 


6.04 


Wi .W2 


12.95 


w 2 .ci 


0.08 


Wi .W2d 


1.86 


W2-W\ 


2.22 


c 2 ioi 


0.09 


w 2 wid 


0.89 


wi 


0.91 


tOlC 2 


0.44 


W2-W±d 


0.49 


W 2 


0.99 


W\ .C2 


0.09 


ci-u^d 


2.58 


C\W2 


3.45 


wid 


2.71 


W\C2d 


0.8 



Table 2: Usernames exactly matching a pattern. 

Again, because first-chosen usernames might be al- 
ready in use, users typically choose to (or are suggested 
to by the online service itself) add a number as a postfix 
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Figure 7: Number of distinct usernames per user 
distribution 

of their desired username. In particular, we observe that 
in most of these cases, users add exactly respectively 
two or four numbers, in 40% of the cases and 20% re- 
spectively. These ending digits suggest then either the 
year of birth (full or simply the last two digits) or the 
birth date. 

Finally, figure [7] shows the distribution of the number 
of different usernames the users in our dataset utilize. 
The graph shows that most users have two or three 
different usernames. The mean number of usernames 
per user is 2.3. 

7. DISCUSSION 

Recently some governments and institutions are trying 
to pass laws and policies to force users to tie their digital 
identities with their real ones. For example, there is a 
current discussion in China and France [I] on laws that 
would require users to use their real names when posting 
comments on blogs and forums. Similarly, the company 
Blizzard had started an effort to tie real identities to the 
ones used to post comments on its video games forums. 

This work shows that it is clearly possible to tie digital 
identities together and, most likely, to real identities in 
many cases only using ubiquitous usernames. We also 
showed that, even though users are free to change their 
usernames at will, they do not do it frequently and, when 
they do, it is in a predictable way. Our technique might 
then be used as an additional tool when investigating 
online crime. It is however also subject to abuse and 
could result in breaches in privacy. Advertisers could 
automatically build online profiles of users with high 
accuracy and minimum effort, without the consent of 
the users involved. 

Spammers could gather information across the web 
to send extremely targeted spam, which we dub E-mail 
spam 2.0. For example, by matching a Google profile 
and an eBay account one could send spam emails that 
mention a recent sale. In fact, while eBay profiles do 
not show much personal information (like real names) 
they do show recent transactions indexed by username. 
This would enable very targeted and efficient phishing 
attacks. We argue that these targeted attacks might 
have higher click rates for spammers thus leading to 



smaller spam campaigns that would be much harder for 
spam classifiers to recognize. 

Finally, users could use our tool to assess how unique 
and linkable their usernames are. They can thus take an 
informed decision on whether to change their pseudonym 
for their online activity they wish to remain private. 
Paradoxically, it would be difficult for an user who de- 
cides to prevent the linking of her different usernames 
(particularly on OSNs), to choose usernames that are 
unlinkable without loosing some of the benefits of the 
various features of OSNs. 

In the light of our results, an analysis on the nature 
and anonymity of usernames is needed. Historically 
usernames have been used to identify users in small 
groups, one such example are Unix usernames. In groups 
of dozens or few hundreds of people, usernames naturally 
tend to be not identifying and non unique Q As the 
online communities grow in size, so does the entropy 
of the usernames. Nowadays users are forced to choose 
usernames that have to be unique in online services that 
have hundreds of million of users. Naturally users had to 
adapt and choose higher entropy usernames to be able 
to find usernames that were not already assigned. This 
can allow for privacy breaches. 

7.1 Countermeasures 

On the user side. 

Following this work users might change their user- 
name habits and use different usernames on differ- 
ent web services. We released our tool as a web 
application that users can access to estimate how 
unique their username is and thus take informed de- 
cision on the need to change their usernames when 
they deem appropriate (http : //planete . inrialpes 
f r/projects/how-unique-are-your-usernames ). 

For web services. 

There are two main features that make our technique 
possible and exploitable in real case scenarios. First, 
web services and OSNs allow access to public accounts 
of their users via their usernames. This can be used to 
easily check for existence of a given username and to 
automatically gather information. Some web services like 
Twitter are built around this particular feature. Second, 
web services usually allow the user pages to be crawled 
automatically. While in some cases this might be a 
necessary evil to allow search engines to access relevant 
content, in many instances there is no legitimate use of 
this technique and indeed some OSNs explicitly forbid 
it in the terms of service agreements, e.g., Facebook. 

While preventing automatic abuse of public content 
can be difficult in general, for example when the attacker 
has access to a large number of IPs, it is possible to at 
least throttle access to those resources via CAPTCHAs 
[20] or similar techniques. For example, in our study we 
discovered that eBay presents users with a CAPTCHA 
if too many requests are directed to their servers from 

4 We compared the mean entropy of the usernames on 
a shared server in our lab with the ones gathered from 
Google and the difference is remarkable. 



the same IP. 

8. CONCLUSION 

In this paper we introduced the problem of linking 
online profiles using only usernames. Our technique 
has the advantage of being almost always applicable 
since most web services do not keep usernames secret. 
Two family of techniques were introduced. The first one 
estimates the uniqueness of a username to link profiles 
that have the same username. We gather from language 
model theory and Markov-Chain techniques to estimate 
uniqueness. Usernames gathered from multiple services 
have been shown to have a high entropy and therefore 
might be easily traceable. 

We extend this technique to cope with profiles that are 
linked but have different usernames and tie our problem 
to the well known problem of record linkage. All the 
methods we tried have high precision in linking username 
couples that belong to the same users. 

Ultimately we show a new class of profiling techniques 
that can be exploited to link together and abuse the 
public information stored on online social networks and 
web services in general. 
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estimate uniqueness. For example, consider the case of 
first names. Even an uncommon first name does not 
uniquely identify a person in a very large population, e.g. 
the US. However, it is very likely to uniquely identify a 
person in a smaller population, like a classroom. 

To achieve this goal we use the P(u) to calculate 
the expected number of users in the population that 
would likely choose username u. Let us denote by n(u) 
the expected number of users that choose string iiasa 
username in a given population W. The value of n(u) is 
calculated based on P(u) as: 

n(u) = P(u) * W 

where W is the total number of users in the population. 
In our case W is an estimation of the number of users 
on the Internet: 1.93 billions 

In case we are sure there exists at least one user that 
selected the username u (because u is taken on some web 
service) then the computation of n(u) changes slightly: 

n(u) = P{u) * (W - 1) + P(u\u) = P(u) * (W - 1) + 1 

where the addition of 1 comes from the fact that we are 
sure that there exists at least one user that choses u and 
W — 1 is there to account for the person for which we 
are sure of. 

Finally we can estimate the uniqueness of a username 
u by simply considering the probability that our user is 
unique in the reference set determined by n(u), hence: 




APPENDIX 

Username uniqueness from a probabilistic 
point of view 

We now focus on computing the probability that only 
one users has chosen username u in a population. We 
refer to this probability as P un iq(u). 

Intuitively P un iq(u) should increase with the decrease 

in likelihood of P(u). However, P un i q (u) also depends 

on the size of the population in which we are trying to 5 http:/ /www. internetworldstats.com/stats. htm 



